Generalized Additive Models in Fraud Detection

Data Science Capstone Project

Grace Allen, Kesi Allen, Sonya Melton, Pingping Zhou

2025-11-21

Introduction

What are generalized additive models?

  • Not your typical straight-line regression — GAMs let patterns curve naturally

  • Great at uncovering hidden trends in messy real-world data

  • Each feature gets its own shape, showing where risk rises or falls

  • Makes the model’s behavior easy to explain to non-technical teams

  • Perfect for fraud detection, where small pattern changes matter

Brief History of GAMs

Generalized Additive Models were introduced in the late 1980s as a way to add flexibility to traditional regression models. Hastie & Tibshirani (1986) and Hastie & Tibshirani (1990) developed the framework to allow each predictor in a model to follow its own smooth pattern rather than forcing everything into a straight line. Through the 1990s and early 2000s, the approach grew in popularity in fields that needed interpretable models, including public health, ecology, and social sciences.

Brief History of GAMs

A major step forward came with the development of the mgcv package in R, created by Simon Wood (Wood, 2017). His work added modern smoothing techniques, automatic penalty selection, and faster computation, making GAMs practical for large and noisy datasets (Wood, 2025). Today, GAMs are widely used in finance, fraud detection, risk scoring, and other areas where organizations need both predictive accuracy and clear explanations (Dal Pozzolo et al., 2014; DGAM, 2021; Gam.hp, 2020).

GAMS in Action: Real World Uses + Our Study

GAMs help uncover nonlinear relationships and subtle patterns across diverse domains:

GAMS in Action: Real World Uses + Our Study

GAMs help uncover nonlinear relationships and subtle patterns across diverse domains:

Our Project: Study Context: GAMs for Fraud Detection

  • Toolset: RStudio + package

  • Dataset: Kaggle’s Fraud Detection Transactions (Ashar, 2024)

  • Purpose: Identify predictive variables linked to fraudulent activity

  • Context: Synthetic but realistic data for controlled testing

Here’s how we used GAMs to explore patterns in the fraud dataset.

Methods

GAM Modeling Overview

  • GAMs extend traditional regression (Hastie & Tibshirani, 1986)

  • Capture nonlinear predictor-response relationships

  • Use spline-based smooth functions

  • Combine continuous + categorical predictors

  • Fit with mgcv (penalized splines + GCV)

  • Model outputs interpretable smooth effects

  • Goal: Estimate probability of fraud

Modeling Workflow Steps

  1. Acquire synthetic Kaggle fraud dataset (50k rows, 21 features)
  2. Explore distributions and identify skew/outliers
  3. Clean data and review categorical variables
  4. Summarize transaction type, device type, and merchant category
  5. Visualize patterns using histograms and nonlinear trend checks

Modeling Workflow Steps

  1. Check assumptions with k-index, QQ plot, and residual diagnostics
  2. Fit GAM and interpret numeric and categorical smooth effects
  3. Evaluate performance using confusion matrix, ROC curve, and AUC (0.73)
  4. Summarize findings and finalize interpretation

GAM Equation

\[ g(\mu) = \alpha + s_1(X_1) + s_2(X_2) + \dots + s_p(X_p) \]

  • (g) = link function (logit for binary fraud) (HalDa, 2012)

  • Smooth functions capture nonlinear effects

  • Additive contributions from each predictor

  • Balances flexibility + interpretability (Hastie & Tibshirani, 1990)

GAM Assumptions (Fraud Context)

  • Logit link approximates fraud probability

  • Additive and independent predictor effects

  • Smooth, gradual functional relationships

  • Binomial response distribution

  • Independent observations (Wood, 2017)

  • Low predictor multicollinearity

  • Penalization prevents overfitting (Wood, 2017)

Why We Chose GAMs For Fraud Detection

  • Captures nonlinear fraud patterns

  • Handles rare, imbalanced outcomes

  • Produces interpretable smooth risk curves

  • Supports regulatory transparency

  • Balances accuracy + interpretability

  • Strong literature support for fraud analytics

  • Scalable through mgcv’s automated smoothing (Wood, 2017)

Practical Advantages & Relevance to Real-World Analytics

  • Supports investigative decision-making

  • Shows monotonic or nonlinear risk curves

  • Supports investigative decision-making

  • Can benchmark or surrogate black-box models

  • High recall for suspicious transactions

  • Useful for auditors, fraud teams, analysts

  • Aligns with both operational and compliance needs

Analysis and Results

Data Exploration and Visualization

Dataset Description

What It Is

  • A synthetic dataset built to mimic real financial transactions

  • Privacy‑safe: no real people’s data used

  • Hosted on Kaggle

Dataset Description

Why We Use It

  • Train fraud detection models for binary classification tasks

  • Spot fraud: each transaction labeled as fraud (1) or not fraud (0)

Dataset Description

What Makes It Special

Realistic fraud patterns:

  • Groups of fraudulent transactions

  • Subtle, hard‑to‑notice anomalies

  • Odd user behaviors

  • Large & diverse records: balances normal vs. rare fraud cases → addresses class imbalance.

Dataset Key Characteristics

What’s Inside

  • 50,000 Rows: A good amount of data to work with.

  • Two Labels: Every transaction is marked as either: 1 = Fraud 0 = Not Fraud

Data Features

21 features across three categories:

  • Numbers: Like transaction amounts, risk scores, account balances.

  • Categories: Transaction types (payment, transfer, withdrawal), device types, merchant categories.

  • Time Data: When transactions happened (time, day) and their sequence.

Label Distribution Class Imbalance:

  • Fraudulent transactions are a small percentage, reflecting real-world scenarios.

  • Behavioral Realism: Includes unusual spending, behavioral signals, and high-risk profiles.

  • Modeling flexibility: supports interpretable (GAMs, logistic regression) or high-performance (XGBoost) approaches

Dataset Visualizations

Categorical Distribution

Table 1 – Transaction Types and Counts
Type Count
POS 12,549
Online 12,546
ATM Withdrawal 12,453
Bank Transfer 12,452
Table 2 – Device Types and Counts
Device Count
Tablet 16,779
Mobile 16,640
Laptop 16,581
Table 3 – Merchant Categories and Counts
Merchant_Category Count
Clothing 10,033
Groceries 10,019
Travel 10,015
Restaurants 9,976
Electronics 9,957

Distribution of Variables

Card Age

Non-linearity Check

Modeling and Results

Assumptions

GAM Analysis for Numeric Variables

GAM Analysis for Categorical Variables

Transaction_TypeBankTransfer
p.value: 0.498
OR_low: 0.931
OR_high: 1.035

Transaction_TypeOnline
p.value: 0.546
OR_low: 0.933
OR_high: 1.037

GAM Analysis for Categorical Variables

Merchant_CategoryElectronics
p.value: 0.734
OR_low: 0.952
OR_high: 1.072

Merchant_CategoryGroceries
p.value: 0.531
OR_low: 0.960
OR_high: 1.082

GAM Model for Key Predictor

GAM Equation for Key Predictor

GAM equation structure:

\[ g(\mu) = \alpha + s_1(X_1) + s_2(X_2) + \dots + s_p(X_p) \]

our model simplifies to a single predictor:

\[ \text{logit}(\Pr(\text{Fraud} = 1)) = \alpha + s(\text{Risk_Score}) \]

where alpha = 1.9109 is the intercept, representing the baseline log-odds of fraud when Risk_Score is zero.

Model Diagnostics and Performance Metrics

Model Diagnostics and Performance Metrics

Conclusion

Key Findings & Model Performance

What We Found:

  • Risk Score: Strongest predictor — fraud probability spikes near 0.75

  • Transaction Amount: Moderate effect — larger transactions more likely to be fraudulent

Model Performance

  • True Positives: 2,318

  • True Negatives: 10,102

  • False Positives: 77

  • False Negatives: 2,502

  • AUC (ROC Curve): 0.73 — good discriminative ability

Limitations

  • Synthetic dataset limits real-world generalizability

  • Class imbalance affects recall for rare fraud case

  • More complex models may improve accuracy

Insights & Next Steps

  • Combine GAMs with other ML methods

  • Test on real-time or streaming data

  • Refine thresholds for cost-sensitive decisions

QUESTIONS?

References

Ashar, S. (2024). Fraud detection transactions dataset. Kaggle. https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset
Brossart, D. F., Clay, D. L., & Willson, V. (2015). Detecting contaminated birthdates using generalized additive models. BMC Bioinformatics, 16(185), 1–9. https://doi.org/10.1186/s12859-015-0636-0
Dal Pozzolo, A., Caelen, O., Le Borgne, Y. A., Waterschoot, S., & Bontempi, G. (2014). Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications, 41(10), 4915–4928. https://doi.org/10.1016/j.eswa.2014.02.026
Detmer, A. (2025). Ecological thresholds and generalized additive models. Journal of Ecology Research, 45(3), 215–230.
DGAM. (2021). Dynamic generalized additive models (DGAMs) for forecasting. PeerJ, 9, e10974. https://doi.org/10.7717/peerj.10974
Gam.hp. (2020). Evaluating the relative importance of predictors in generalized additive models using the gam.hp r package. Comprehensive R Archive Network (CRAN). https://cran.r-project.org/package=gam.hp
Guisan, A., Edwards, T. C., & Hastie, T. (2002). Generalized linear and generalized additive models in studies of species distributions: Setting the scene. Ecological Modelling, 157(2–3), 89–100. https://doi.org/10.1016/S0304-3800(02)00204-1
HalDa, C. (2012). Generalized linear models and generalized additive models (lecture notes, chapter 13). Department of Statistics, Carnegie Mellon University. http://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/13/lecture-13.pdf
Hastie, T., & Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1(3), 297–310. http://www.jstor.org/stable/2245459
Hastie, T., & Tibshirani, R. (1990). Generalized additive models. Chapman & Hall/CRC.
Tragouda, K., Papadopoulos, T., & Stefanou, A. (2024). Identification of fraudulent financial statements through a multi-label classification approach. Intelligent Systems in Accounting, Finance and Management. https://doi.org/10.1002/isaf.225
White, L. F., Jiang, W., Ma, Y., So-Armah, K., Samet, J. H., & Cheng, D. M. (2020). Tutorial in biostatistics: The use of generalized additive models to evaluate alcohol consumption as an exposure variable. Drug and Alcohol Dependence, 209, 107944. https://doi.org/10.1016/j.drugalcdep.2020.107944
Wood, S. N. (2017). Generalized additive models: An introduction with r (2nd ed.). Chapman; Hall/CRC.
Wood, S. N. (2025). Mgcv: Mixed GAM computation vehicle with automatic smoothness estimation (r package version 1.9-1). Comprehensive R Archive Network (CRAN). https://cran.r-project.org/package=mgcv